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Sulfobacillus acidophilus Norris et al. 1996 is a member of the genus Sulfobacillus which 
comprises five species of the order Clostridiales. Sulfobacillus species are of interest for com- 
parison to other sulfur and iron oxidizers and also have biomining applications. This is the 
first completed genome sequence of a type strain of the genus Sulfobacillus, and the second 
published genome of a member of the species 5. acidophilus. The genome, which consists of 
one chromosome and one plasmid with a total size of 3,557,831 bp harbors 3,626 protein- 
coding and 69 RNA genes, and is a part of the Genomic Encyclopedia of Bacteria and 
Archaea project. 



Introduction 



The genus Sulfobacillus currently consists of five 
species [1], all of which are mildly thermophilic or 
thermotolerant acidophiles [2]. Sulfobacilli grow 
mixotrophically by oxidizing ferrous iron, sulfur, and 
mineral sulfides in the presence of yeast extract or 
other organic compounds [3]. Some can also grow 
autotrophically [2,3]. The strains that have been 
tested are capable of anaerobic growth using Fe+3 as 
an electron acceptor [2,4]. The genus Sulfobacillus, 
along with the genus Thermaerobacter, have only 
tentatively been assigned to a family, "Clostridiales 



Family XVII incertae sedis". This group may form a 
deep branch within the phylum Firmicutes or may 
constitute a new phylum [5]. Strain NAL^ [= DSM 
10332 = ATCC 700253] is the type strain of the spe- 
cies Sulfobacillus acidophilus. The genus name was 
derived from the Latin words 'sulfur' and 'bacillus' 
meaning 'small sulfur-oxidizing rod' [6]. The species 
epithet is derived from the Neo-Latin words 
'acidum', acid, and 'philus', loving, meaning acid- 
loving [3]. The first genome from a member of the 
species 5. acidophilus, strain TRY, which was isolated 




The Genomic Standards Consortium 



Sulfobacillus acidophilus type strain (NALT) 



from a hydrothermal vent in the Pacific Ocean, was 
recently sequenced by Li et al. [7]. Here we present a 
summary classification and a set of features for S. 
acidophilum strain NAL^, together with the descrip- 
tion of the complete genomic sequencing and anno- 
tation. 

Classification and features 

A representative genomic 16S rRNA sequence of 5. 
acidophilus NAU was compared using NCBI 
BLAST [8,9] under default settings (e.g., consider- 
ing only the high-scoring segment pairs (HSPs] 
from the best 250 hits] with the most recent re- 
lease of the Greengenes database [10] and the rel- 
ative frequencies of taxa and keywords [reduced 
to their stem [11]) were determined, weighted by 
BLAST scores. The most frequently occurring gen- 
era were Sulfobacillus [81.9%), Thermaerobacter 
[8.0%), Laceyella [2.8%), 'Gloeobacter' [2.1%) and 
'Synechococcus' [2.0%) [76 hits in total). Regard- 
ing the six hits to sequences from members of the 
species, the average identity within HSPs was 
98.9%, whereas the average coverage by HSPs 
was 97.2%. Regarding the 23 hits to sequences 
from other members of the genus, the average 
identity within HSPs was 93.1%, whereas the av- 
erage coverage by HSPs was 81.2%. Among all 
other species, the one yielding the highest score 
was "Sulfobacillus yellowstonensis" [AY007665), 
which corresponded to an identity of 99.4% and 
an HSP coverage of 97.0%. [Note that the 
Greengenes database uses the INSDC [= 
EMBL/NCBI/DDBJ) annotation, which is not an 
authoritative source for nomenclature or classifi- 
cation.) The highest-scoring environmental se- 
quence was HQ730681 ['Microbial Anaerobic Sed- 
iments Tinto River: Natural Acid and Heavy Metals 
Content extreme acid clone SNl 2009 12D'), 
which showed an identity of 94.5% and an HSP 
coverage of 99.0%. The most frequently occurring 
keywords within the labels of all environmental 
samples which yielded hits were 'acid' [4.8%), 
'soil' [4.5%), 'hydrotherm' [3.7%), 'microbi' 
[3.7%) and 'mine' [3.0%) [172 hits in total). These 
keywords correspond well to the environment 
from which strain NAL^ was isolated. Environ- 
mental samples that yielded hits of a higher score 
than the highest scoring species were not found. 

Figure 1 shows the phylogenetic neighborhood of 
S. acidophilus NAL^ in a 16S rRNA based tree. The 
sequences of the five 16S rRNA gene copies in the 



genome differ from each other by up to eight nu- 
cleotides, and differ by up to four nucleotides from 
the previously pubUshed 16S rRNA sequence 
[AB089842), which contains two ambiguous base 
calls. 

Cells of S. acidophilus NAL^ are rods 3.0-5.0 [im in 
length and 0.5-0.8 |im in width [Table 1 and Fig- 
ure 2) [3]. Cells are Gram-positive and form spher- 
ical endospores [3]. Flagella were not observed 
[3]. Strain NAL^ was found to grow between 28°C 
and 62°C with an optimum at 48°C [35]. The up- 
per and lower temperatures for growth were not 
determined but were predicted to be 10°C and 
62°C [35]. The pH range for growth was 1.6-2.3 
with an optimum at 1.8 [35]. Three strains of S. 
acidophilus have been found to be facultative an- 
aerobes that are able to use Fe+3 as an electron 
acceptor under anaerobic conditions [4]; but 
strain NALt was not tested in this study. Strain 
NALT can grow autotrophically or mixotrophically 
by oxidizing Fe+2, sulfur, or mineral sulfides or 
heterotrophically on yeast extract [3]. S. acidophi- 
lus and other sulfobacilli have potential applica- 
tions in biomining. Strain NAL^ increased the 
leaching of numerous mineral sulfides [35], how- 
ever, its sensitivity to low concentrations of met- 
als may Umit its usefulness in biomining [35]. 

Genome sequencing and annotation 

Genome project history 

This organism was selected for sequencing on the 
basis of its phylogenetic position [38], and is part 
of the Genomic Encyclopedia of Bacteria and 
Archaea project [39]. The genome project is de- 
posited in the Genomes OnLine Database [18] and 
the complete genome sequence is deposited in 
GenBank. Sequencing, finishing and annotation 
were performed by the DOE Joint Genome Insti- 
tute [JGI). A summary of the project information is 
shown in Table 2. 

Growth conditions and DNA isolation 

S. acidophilus strain NAL^, DSM 10332, was grown 
in DSMZ medium 709 [Acidomicrobium medium) 
[40] at 45°C. DNA was isolated from 0.5-1 g of cell 
paste using MasterPure Gram-positive DNA purifi- 
cation kit [Epicentre MGP04100) following the 
standard protocol as recommended by the manu- 
facturer with modification st/LALM for cell lysis 
as described in Wu et al. 2009 [39]. DNA is availa- 
ble through the DNA Bank Network [41]. 
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Figure 1. Phylogenetic tree highlighting the position of 5. acidophilus relative to the type strains of the other species 
within the genus Sulfobacillus. The tree was inferred from 1,422 aligned characters [12,13] of the 1 6S rRNA gene se- 
quence under the maximum likelihood (ML) criterion [14]. The comparatively closely related genus Symbiobacterium 
[15] was included for rooting the tree. The branches are scaled in terms of the expected number of substitutions per 
site. Numbers adjacent to the branches, if any, are support values from 1,000 ML bootstrap replicates [16] (left) and 
from 1,000 maximum parsimony bootstrap replicates [17[ (right) if larger than 60% (i.e., there were none). Lineages 
with type strain genome sequencing projects registered in GOLD [18] are labeled with one asterisk, those also listed as 
'Complete and Published' with two asterisks [19]. 




Figure 2. Scanning electron micrograph of S. acidophilus NAL^ 
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Table 1. Classification and general features of S. acidophilus NAL^ according to the MIGS recommen- 
dations [20] and the NamesforLife database [21]. 



MIGS ID 


Property 


Term 


Evidence code 






Domain Bacteria 


TAS 


[22] 






Phylum "Firmicutes" 


TAS 


[23-25] 






Class Clostridia 


TAS 


[26,27] 




Current classification 


Order Clostridiales 


TAS 


[28,29] 






Family "XVII incertae sedis" 


TAS 


[5,30] 






Genus Sulfobacillus 


TAS 


[31-33] 






Species Sulfobacillus acidophilus 


TAS 


[3,34] 






Type strain NAL 


TAS 


[3] 




Gram stain 


positive 


T A C 

TAS 


[3] 




Cell shape 


rods 


TAS 


[3] 




Motility 


non-motile 


NAS 






Sporulation 


spherical endospores 


TAS 


[3] 




Temperature range 


not reported 








Optimum temperature 


48°C 


TAS 


[35] 




Salinity 


not reported 






MIGS-22 


Oxygen requirement 


facultative anaerobe 


TAS 


[4] 




Carbon source 


CO2, organic compounds 


TAS 


[3] 




Energy metabolism 


autotrophic, mixotrophic, heterotrophic 


TAS 


[3] 


MIGS-6 


Habitat 


acidic sulfidic and sulfurous sites 


TAS 


[35] 


MIGS-15 


Biotic relationship 


free-living 


TAS 


[3] 


MIGS-14 


Pathogenicity 


none 


NAS 






Biosafety level 


1 


TAS 


[36] 




Isolation 


coal spoil heap 


TAS 


[3] 


MIGS-4 


Geographic location 


Alvecote, North Warwickshire, UK 


TAS 


[3] 


MIGS-5 


Sample collection time 


1988 


TAS 


[3] 


MIGS-4.1 


Latitude 


52.638 


TAS 


[3] 


MIGS-4.2 


Longitude 


-1.641 


TAS 


[3] 


MIGS-4.3 


Depth 


not reported 






MIGS-4.4 


Altitude 


not reported 







Evidence codes - IDA: Inferred from Direct Assay (first time in publication); TAS: Traceable Author 
Statement (i.e., a direct report exists in the literature); NAS: Non-traceable Author Statement (i.e., not 
directly observed for the living, isolated sample, but based on a generally accepted property for the 
species, or anecdotal evidence). These evidence codes are from the Gene Ontology project [37]. If the 
evidence code is IDA, then the property was directly observed for a living isolate by one of the au- 
thors or an expert mentioned in the acknowledgements. 
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Table 2. Genome sequencing project information 



MIGS ID 


Property 


MIGS-31 


Finishing quality 


MIGS-28 


Libraries used 


MIGS-29 


Sequencing platforms 


MIGS-31 .2 


Sequencing coverage 


MIGS-30 


Assemblers 


MIGS-32 


Gene calling method 




INSDC ID 




Genbank Date of Release 




GOLD ID 




NCBI project ID 




Database: IMG-GEBA 


MIGS-13 


Source material identifier 




Project relevance 



Term 



Finished 

Four genomic libraries: one 454 pyrosequence standard library, 
two 454 PE libraries (6 kb and 1 0 kb insert size), one lllumina library 

lllumina GAii, 454 GS FLX Titanium 

168.4 X lllumina; 51.2 x pyrosequence 

Newbler version 2.3-PreRelease-6/30/2009, Velvet 1 .0.1 3, 

phrap version SPS - 4.24 

Prodigal 1 .4, GenePRlMP 

CP0031 79 (chromosome) 

CP003180 (plasmid, unnamed) 

December 14, 201 1 

Gc02053 

40777 

2506520015 

DSM 10332 

Tree of Life, GEBA, biomining 



Genome sequencing and assembly 

The genome was sequenced using a combination of 
lllumina and 454 sequencing platforms. All general 
aspects of library construction and sequencing can be 
found at the JGI website [42]. Pyrosequencing reads 
were assembled using the Newbler assembler 
[Roche]. The initial Newbler assembly consisting of 
104 contigs in three scaffolds was converted into a 
phrap [43] assembly by making fake reads from the 
consensus, to collect the read pairs in the 454 paired 
end library. lllumina GAii sequencing data (599.7 Mb] 
were assembled with Velvet [44] and the consensus 
sequences were shredded into 1.5 kb overlapped fake 
reads and assembled together with the 454 data. The 
454 draft assembly was based on 143.7 Mb of 454 
draft data and all of the 454 paired-end data. Newbler 
parameters were -consed -a 50 -1 350 -g -m -ml 20. 
The Phred/Phrap/Consed software package [43] was 
used for sequence assembly and quality assessment 
in the subsequent finishing process. After the shotgun 
stage, reads were assembled with parallel phrap 
[High Performance Software, LLC]. Possible mis- 
assemblies were corrected with gapResolution (C. 
Han, unpublished], Dupfinisher [45], or sequencing 
cloned bridging PGR fragments with subcloning. Gaps 
between contigs were closed by editing in Consed, 
PGR and Bubble PGR primer walks Q.-F. Ghang, un- 
published]. A total of 640 additional reactions and 
eight shatter libraries were necessary to close gaps 
and to raise the quality of the finished sequence, 
lllumina reads were also used to correct potential 
base errors and increase consensus quality using the 
software Polisher developed at JGI [46]. The error 
rate of the completed genome sequence is less than 1 



in 100,000. Together, the combination of the lllumina 
and 454 sequencing platforms provided 219.6 x cov- 
erage of the genome. The final assembly contained 
612,059 pyrosequence and 16,626,072 lllumina 
reads. 

Genome annotation 

Genes were identified using Prodigal [47] as part of 
the Oak Ridge National Laboratory genome annota- 
tion pipeline, followed by a round of manual 
curation using the JGI GenePRlMP pipeline [48]. 
The predicted GDSs were translated and used to 
search the National Center for Biotechnology In- 
formation [NCBI] nonredundant database, UniProt, 
TIGR-Fam, Pfam, PRIAM, KEGG, COG, and InterPro 
databases. Additional gene prediction analysis and 
functional annotation was performed within the 
Integrated Microbial Genomes - Expert Review 
(IMG-ER] platform [49]. 

Genome properties 

The genome consists of one circular chromosome of 
3,472,898 bp and one circular plasmid of 84,933 bp 
length with an overall G+C content of 56.8% [Table 3 
and Figures 3 and 4]. Based on coverage of 454 
paired ends, the plasmid may be inserted into the 
chromosome in about half of the population. Of the 
3,695 genes predicted, 3,626 are protein-coding 
genes, and 69 are RNAs; 155 pseudogenes were also 
identified. The majority of the protein-coding genes 
[68.3%] were assigned a putative function while the 
remaining ones were annotated as hypothetical pro- 
teins. The distribution of genes into COGs functional 
categories is presented in Table 4. 
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Table 3. Genome Statistics 



Attrlhi ifp 
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% nf Total' 
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1 ftQA 

J ,0"0 


^9 9QO/ 

JZ .Z J /o 


Vjcllcb dbblgllcU LU V^wvJD 


9 7An 


7^ 1:70/ 

/ 3.3/ /O 


- 1 I-,J- 1 

Genes assigned Pfam domains 


413 


11.39% 


Genes with signal peptides 


652 


1 7.98% 


Genes with transmembrane helices 


910 


25.10% 


CRISPR repeats 


2 





a) The total is based on either the size of the genome in base pairs or 
the total number of protein coding genes in the annotated genome. 



Insights into the genome sequence 
Comparative genomics 

While the sequencing of the genome described in 
this paper was underway, Li et al. from the Third 
Institute of Oceanography, Xiamen, China pub- 
Hshed the complete genome sequence of strain 
TRY [7]. The two genomes differ in size by less 
than 7,000 bp. Here, we take the opportunity to 
compare the completed genome sequences from 
these two stains, NAL^" and TPY, both belonging to 
5. acidophilus. While the biological material for the 
type stain, NALt is publicly available from the 
DSMZ open collection for postgenomic analyses, 
no source of the biological material (MIGS-13 cri- 
terion, see Table 2) of strain TPY was provided by 
Li etal. [7]. 

To estimate the overall similarity between the ge- 
nomes of strains NALt and TPY (Genbank acces- 
sion number: CP002901], the GGDC-Genome-to- 
Genome Distance Calculator [50,51] was used. The 
system calculates the distances by comparing the 



genomes to obtain HSPs (high-scoring segment 
pairs] and interfering distances from three formu- 
lae (HSP length / total length; identities / HSP 
length; identities / total length). The comparison 
of the genomes of strains NAL^ and TPY revealed 
that 99.65% of the average of the genome lengths 
are covered with HSPs. The identity within these 
HSPs was 99.01%, whereas the identity over the 
whole genome [counting regions not covered by 
HSPs as non-identical) was 98.67%. The inferred 
digital DNA-DNA hybridization values for the two 
strains are 96.47% (formula 1 in [51]), 86.08% 
(formula 2 in [51]) and 97.05% (formula 3 in 
[51]), respectively. These results clearly demon- 
strate that according to the whole genome se- 
quences of strains NAL^ and TPY, the similarity is 
very high, supporting the membership of both 
strains in the same species. 
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Figure 3. Graphical map of the chromosome. From outside to the center: Genes on forward 
strand (colored by COG categories), Genes on reverse strand (colored by COG categories), 
RNA genes (tRNAs green, rRNAs red, other RNAs black), GC content, GC skew. 




Figure 4. Graphical map of the plasmid. From outside 
to the center: Genes on forward strand (colored by 
COG categories), Genes on reverse strand (colored by 
COG categories), RNA genes (tRNAs green, rRNAs 
red, other RNAs black), GC content, GC skew. 
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Table 4. Number of genes associated with the general COG functional categories 


Code 


value 


%age' 


Description 


J 


149 


4.1 


Translation, ribosomal structure and biogenesis 


A 


0 


0.0 


RNA processing and modification 


K 


188 


5.2 


Transcription 


L 


269 


7.4 


Replication, recombination and repair 


B 


1 


0.0 


chromatin structure and dynamics 


D 


26 


0.7 


Cell cycle control, cell division, chromosome partitioning 


Y 


0 


0.0 


Nuclear structure 


V 


34 


0.9 


Defense mechanisms 


T 


1 1 1 


3.1 


Signal transduction mechanisms 


M 


149 


4.1 


Cell wall/membrane/envelope biogenesis 


N 


47 


1.3 


Cell motility 


Z 


0 


0.0 


Cytoskeleton 


w 


0 


0.0 


Extracellular structures 


u 


62 


1.7 


Intracellular trafficking, secretion, and vesicular transport 


o 


129 


3.6 


Posttranslational modification, protein turnover, chaperones 

'J ' r 


c 


244 


6.7 


Energy production and conversion 


G 


215 


5.9 


Carbohydrate transport and metabolism 


E 


257 


7.1 


Amino acid transport and metabolism 


F 


89 


2.5 


Nucleotide transport and metabolism 


H 


153 


4.2 


Coenzyme transport and metabolism 


1 


130 


3.6 


Lipid transport and metabolism 


P 


121 


3.3 


Inorganic ion transport and metabolism 


Q 


81 


2.2 


Secondary metabolites biosynthesis, transport and catabolism 


R 


326 


9.0 


General function prediction only 


S 


239 


6.6 


Function unknow^n 




886 


24.4 


Not in COGs 


a) The percentaj 


;e is based 


on the total number of protein coding genes in the annotated 



genome. 



The comparison of the number of genes belonging 
to the different COG categories revealed few differ- 
ences between the genomes of strains NAL^ and 
TPY. Strain NAL^" has 2,740 genes with COGs as- 
signed, while strain TPY has 2,700. We analyzed the 
differences in COG assignment between the two 
strains and found that in almost all cases they could 
be explained by differences in the gene calls or 
pseudogene assignment, i.e. in one genome two 
parts of a pseudogene were called as two separate 
genes, while in the other genome they were com- 
bined into one pseudogene. The only clear case of a 
difference in gene content between the two strains 
is the presence of a transposable element consist- 
ing of two genes [Sulac_1668, Sulac_1669) disrupt- 
ing a subunit of a potassium transporter 
(Sulac_1667) in strain NAL^. There were also cases 
where a gene in one strain was split into two genes 
in the other strain. For example, Sulac_2178 corre- 
sponds to TPY_1983 and TPY1984, and Sulac_0347 
corresponds to TPY_0381 and TPY_0382. In both 
cases the differences are due to a single base indel. 



A dot plot showed that there are large blocks of 
synteny between the two genomes with some rear- 
rangements [data not shown). The genes found on 
the plasmid in strain NAL^ are found in two regions 
of the chromosome in strain TPY. Sulac_3528-3555 
corresponds to TPY_0524-0552, while Sulac_3556- 
3626 corresponds to TPY_2310-2244. This sug- 
gests that in strain TPY, the plasmid was inserted 
into the chromosome and then split into two piec- 
es. 

We analyzed CRISPR repeats with the CRISPR 
Recognition Tool [52] and found major differences 
between the two strains. They both have two re- 
gions of CRISPR repeats, but the strain TPY repeat 
regions have 8 and 9 repeats while the strain NAL^ 
repeat regions have 27 and 43 repeats. All of the 
spacers in the TPY repeat regions are found in 
NAL^, but NAL''" has many additional spacers. This 
agrees with previous results suggesting that 
CRISPRs evolve quickly, and differences can be 
found in closely related strains [53]. 
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